Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Install this application on your home screen for quick and easy access when you’re on the go.
Just tap then “Add to Home Screen”
Member rate £492.50
Non-Member rate £985.00
Save £45 Loyalty discount applied automatically*
Save 5% on each additional course booked
*If you attended our Methods School in Summer/Winter 2024.
Monday 31 July - Friday 4 August
09:00-12:30
Please see Timetable for full details.
The course provides a set of tools that can be employed in situations when standard OLS estimation does not produce adequate estimates. Weighted Least Squares and cluster-corrected standard errors are offered as solutions to the problem of heteroskedasticity, which can severely impact estimates of uncertainty in standard OLS. Interactions can be of help in instances when we have reason to suspect that effects vary across subgroups in the population. Nonlinear regression can handle situations when the relationship between two variables has a more complex form than a simple straight line. Finally, certain types of robust regression serves to produce proper estimates even in situations of data outliers. All these topics are discussed both theoretically, in the lectures, as well as practically, with the use of actual social science data sets and the R statistical environment. The topics represent a middle ground between the standard linear regression framework and more advanced GLM procedures, or the multilevel modeling framework.
Constantin Manuel Bosancianu is a postdoctoral researcher in the Institutions and Political Inequality unit at Wissenschaftszentrum Berlin.
His work focuses on the intersection of political economy and electoral behaviour: how to measure political inequalities between citizens of developed and developing countries, and what the linkages between political and economic inequalities are.
He is interested in statistics, data visualisation, and the history of Leftist parties. Occasionally, he teaches methods workshops on regression, multilevel modelling, or R.
As an estimation framework, OLS (Ordinary Least Squares) has clear advantages over alternatives when it comes to ease of understanding, speed, elegance and robustness in the face of distributional “irregularities” in data. At the same time, the instances when OLS can safely be employed in a standard way with social science data are rather few and far between. This class aims to expand the statistical toolbox of participants by presenting a set of regression tools that can be applied in situations when standard OLS would lead to suboptimal results.
We start with an in-depth coverage of OLS assumptions, emphasizing in particular the need for continuous (or dichotomous) measures in our regressions, a normal distribution of residuals, homoskedasticity, and linear relationships. We discuss, in turn, how OLS estimates of effect and uncertainty are impacted by violations of these assumptions, and what tools we have available in R to diagnose these problems. I make the point that these assumptions are frequently not met in the course of many analyses, leading to biased estimates and, therefore, shaky conclusions. The topics that follow in the course represent a few of the modeling strategies available to researchers when such assumptions are clearly invalid. While more complex than OLS, they are nevertheless still part of the “least squares” framework, and somewhat easier to apply than more complex procedures, like multilevel modeling.
We first discuss heteroskedasticity: what its implications are for estimates, how it can be detected in the course of a standard analysis, and how commonly it appears as a problem. To address this issue, I present two potential solutions. The first is cluster-corrected standard errors, which might remedy estimates of uncertainty if the underlying problem is generated by clustering in the data. Cluster-corrected SEs continue to be a very popular approach in a variety of disciplines, which is why they are covered in depth here. The second, more general, solution is the use of Weighted Least Squares (WLS). Both subtopics are discussed from a theoretical perspective, as well as in a practical setting, in the laboratory. The third day of the course is taken up by the issue of effect heterogeneity across different subpopulations in the sample. In practice, this will involve an in-depth discussion of interactions in linear models. We cover two-way and three-way interactions, both for continuous and dichotomous predictors, as well as how to present marginal effects in a graphical way. As we will see, interactions are frequently a source of confusion in published work, and continue to frequently be misinterpreted. In the final part of the day, we discuss how the interaction framework might be applied when the subpopulations are actually samples from different countries, through the use of fixed effects. As in the previous days, the theoretical coverage is followed by applied lab work, using R and empirical data.
We continue, in the fourth session, with how to model non-linear relationships between variables. Rather than use statistical transformations of the predictors, we rely in this section on polynomials and regression splines. I show how these tools allow the researcher to model increasingly complex relationships between predictors and outcomes. We conclude the section with a presentation of nonlinear least squares, as an estimation method. The final section is taken up by the topic of robust regression. Unlike the second day, though, we refer to regression estimates robust to data outliers, rather than robust to violations of homoskedasticity. Data outliers can severely impact both the magnitude of the regression coefficients and the standard errors produced. We cover three types of procedures in this session: M estimation, bounded-influence regression, and quantile regression. We take up this discussion again in the laboratory, with practical examples of the differences between standard OLS estimation and robust regression in the presence of outliers.
By the end of the course participants should be able to recognize the situations where OLS does not produce adequate estimates, and to identify the specific cause(s) for this breakdown. After this diagnosis, they should be able to either re-estimate using a more appropriate model specification, or to apply the needed corrections to the initial estimates. Finally, they should feel capable of interpreting the estimates from the revised models, and of summarizing the procedures they have implemented in a concise way.
The topic of how to go beyond the standard assumptions of OLS is much broader than the topics covered here. Due to lack of time, I cannot also offer even an overview of a variety of resampling methods (bootstrapping, jackknifing), or of how to correct for estimation problems arising from the absence of full information on some variables (missing data). Those who are interested in these particular topics for their research should aim for one of the other courses offered at either the Winter or the Summer School, dealing with these specific topics. Despite its close connection with the issues of heteroskedasticity, and of varying effects across different subpopulations, I also cannot cover the topic of multilevel modeling. Suitable courses for MLM are offered both in the Winter and Summer editions of the School, and I encourage those interested in the topic to take either of them.
Due to the advanced nature of the course I expect participants to have good knowledge of linear regression. They ought to be familiar with running such a regression, interpreting coefficients and standard errors, assessing model fit, and diagnosing problems with the estimation procedure. In addition to this foundation of statistical information, participants are also required to have good knowledge of the R statistical environment. This refers to common procedures such as reading in data, cleaning and recoding variables, as well as running a regression in R and manipulating the output object (e.g. extracting coefficients or standard errors).
Day | Topic | Details |
---|---|---|
Monday | OLS assumptions |
We discuss regression assumptions, with particular emphasis on four: continuous predictors, normal distribution of errors, homoskedasticity, and linear relationships. Diagnostic tools for each of these four assumptions. The effect of assumption violations on estimates. In the lab, we cover these points in R, with particular focus on diagnostic tools available. |
Tuesday | Addressing heteroskedasticity: cluster-corrected SEs and WLS |
Discussion of the impact of heteroskedasticity on OLS estimates. Cluster-corrected SEs (Huber–White) as a solution to heteroscedasticity. Where cluster-corrected SEs do not work. Weighted Least Squares, in cases of either known or unknown variance structure. In the lab we cover both strategies in R, with special attention given to cluster-corrected SEs. |
Wednesday | Effect heterogeneity: interactions and fixed-effects |
Interactions in linear regression: two-way and three-way specifications. Overview of interpretation for different types of interaction: continuous × continuous, dichotomous × continuous, dichotomous × dichotomous. Graphical methods of presenting marginal effects from interactions. Interpreting main effects in linear models with interactions. Special case: fixed-effects to model effect heterogeneity. |
Thursday | Nonlinear regression: polynomials and splines |
Non-linear relationships in OLS. Data transformations as solution to non-linearity. Modeling non-linearity directly: (I) Polynomials in regression. Modeling non-linearity directly: (II) Regression spines. |
Friday | Robust regression |
The impact of outliers on OLS estimates. Diagnostics for outliers. Robust regression: (I) M estimation. Robust regression: (II) bounded-influence regression. Robust regression : (III) quantile regression. |
Day | Readings |
---|---|
Monday |
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 12: “Diagnosing non-normality, nonconstant error variance, and nonlinearity” (pp. 267–306) |
Tuesday |
Wooldridge, J. M. (2013). Introductory Econometrics: A Modern Approach, 5th edition. Mason, OH: Cengage Learning. Chapter 8: “Heteroskedasticity” (pp. 268–302) |
Wednesday |
Kam, C. D., & Franzese Jr., R. J. (2007). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. Ann Arbor, MI: The University of Michigan Press. Chapter 3 (“Theory to practice”) and Chapter 4 (“The meaning, use, and abuse of some common general-practice rules”), pp. 13–102. |
Thursday |
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 17: “Nonlinear regression” (pp. 451–475). Motulsky, H. J., & Ransnas, L. A. (1987). “Fitting curves to data using nonlinear regression: a practical and nonmathematical review.” The FASEB Journal, 1(5), 365–374 |
Friday |
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 11: “Unusual and influential data” (pp. 241–266). Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. New York: Sage. Chapter 19: “Robust regression” (pp. 530–547). |
R version 3.3.2 (or newer).
Rstudio version 1.0.136 (or newer).
At least a Pentium Core 2 Duo processor, and a machine with minimum 2 GB of RAM. Around 300-400 MB of free HDD space, for installing additional R packages and storing data. Any laptop bought after 2011 ought to be fine in terms of these minimum requirements.
Belsley, D. A., Kuh, E., & Welsch, R. E. (2004). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Berry, W. D. (1993). Understanding Regression Assumptions. Quantitative Applications in the Social Sciences. Thousand Oaks, CA: Sage Publications.
Brambor, T., Clark, W. R., & Golder, M. (2005). Understanding Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82.
Braumoeller, B. F. (2004). Hypothesis Testing and Multiplicative Interaction Terms. International Organization, 58(4), 807–820.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 3rd ed. Mahwah, NJ: Lawrence Erlbaum Associates. See particularly chapters 4, 6, 7, 9, and 10.
Hao, L., & Naiman, D. Q. (2007). Quantile Regression. London: Sage Publications.
Jaccard, J., & Turrisi, R. (2003). Interaction Effects in Multiple Regression (2nd ed.). London: Sage Publications.
Ritz, C., & Streibig, J. C. (2008). Nonlinear Regression with R. New York: Springer.
Ryan, T. P. (2008). Modern Regression Methods (2nd ed.). Hoboken, NJ: Wiley. See particularly chapters 2, 6, 8, 11, and 13.
Sheather, S. J. (2009). A Modern Approach to Regression with R. New York: Springer. See chapters 3 and 4.
Weisberg, S. (2005). Applied Linear Regression (3rd ed.). Hoboken, NJ: Wiley-Interscience. See chapter 11.
Summer School
Winter School
Summer School
Winter School